AITopics | entity recognition dataset

Collaborating Authors

entity recognition dataset

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media

Ge, Yao, Das, Sudeshna, O'Connor, Karen, Al-Garadi, Mohammed Ali, Gonzalez-Hernandez, Graciela, Sarker, Abeed

arXiv.org Artificial IntelligenceMay-9-2024

Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its clinical and social impacts. We collected data from chosen subreddits using the publicly available Application Programming Interface for Reddit. We manually annotated text spans representing clinical and social impacts reported by people who also reported personal nonmedical use of substances including but not limited to opioids, stimulants and benzodiazepines. Our objective is to create a resource that can enable the development of systems that can automatically detect clinical and social impacts of substance use from text-based social media data. The successful development of such systems may enable us to better understand how nonmedical use of substances affects individual health and societal dynamics, aiding the development of effective public health strategies. In addition to creating the annotated data set, we applied several machine learning models to establish baseline performances. Specifically, we experimented with transformer models like BERT, and RoBERTa, one few-shot learning model DANN by leveraging the full training dataset, and GPT-3.5 by using one-shot learning, for automatic NER of clinical and social impacts. The dataset has been made available through the 2024 SMM4H shared tasks.

dataset, social impact, substance use, (12 more...)

arXiv.org Artificial Intelligence

2405.06145

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > Georgia > Fulton County > Atlanta (0.05)
North America > United States > Tennessee > Davidson County > Nashville (0.04)
North America > United States > California > Los Angeles County > Los Angeles > Hollywood > West Hollywood (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Addiction Disorder (0.89)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Developing a Named Entity Recognition Dataset for Tagalog

Miranda, Lester James V.

arXiv.org Artificial IntelligenceNov-13-2023

We present the development of a Named Entity Recognition (NER) dataset for Tagalog. This corpus helps fill the resource gap present in Philippine languages today, where NER resources are scarce. The texts were obtained from a pretraining corpora containing news reports, and were labeled by native speakers in an iterative fashion. The resulting dataset contains ~7.8k documents across three entity types: Person, Organization, and Location. The inter-annotator agreement, as measured by Cohen's $\kappa$, is 0.81. We also conducted extensive empirical evaluation of state-of-the-art methods across supervised and transfer learning settings. Finally, we released the data and processing code publicly to inspire future work on Tagalog NLP.

computational linguistic, tagalog, tlu nified -ner, (13 more...)

arXiv.org Artificial Intelligence

2311.07161

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.05)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(8 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset

Rücker, Susanna, Akbik, Alan

arXiv.org Artificial IntelligenceOct-24-2023

The CoNLL-03 corpus is arguably the most well-known and utilized benchmark dataset for named entity recognition (NER). However, prior works found significant numbers of annotation errors, incompleteness, and inconsistencies in the data. This poses challenges to objectively comparing NER approaches and analyzing their errors, as current state-of-the-art models achieve F1-scores that are comparable to or even exceed the estimated noise level in CoNLL-03. To address this issue, we present a comprehensive relabeling effort assisted by automatic consistency checking that corrects 7.0% of all labels in the English CoNLL-03. Our effort adds a layer of entity linking annotation both for better explainability of NER labels and as additional safeguard of annotation quality. Our experimental evaluation finds not only that state-of-the-art approaches reach significantly higher F1-scores (97.1%) on our data, but crucially that the share of correct predictions falsely counted as errors due to annotation noise drops from 47% to 6%. This indicates that our resource is well suited to analyze the remaining errors made by state-of-the-art models, and that the theoretical upper bound even on high resource, coarse-grained NER is not yet reached. To facilitate such analysis, we make CleanCoNLL publicly available to the research community.

cleanconll, entity recognition dataset, noise-free

arXiv.org Artificial Intelligence

2310.16225

Genre: Research Report > Promising Solution (0.73)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

FiNER: Financial Named Entity Recognition Dataset and Weak-Supervision Model

Shah, Agam, Vithani, Ruchit, Gullapalli, Abhinav, Chava, Sudheer

arXiv.org Artificial IntelligenceFeb-22-2023

The development of annotated datasets over the 21st century has helped us truly realize the power of deep learning. Most of the datasets created for the named-entity-recognition (NER) task are not domain specific. Finance domain presents specific challenges to the NER task and a domain specific dataset would help push the boundaries of finance research. In our work, we develop the first high-quality NER dataset for the finance domain. To set the benchmark for the dataset, we develop and test a weak-supervision-based framework for the NER task. We extend the current weak-supervision framework to make it employable for span-level classification. Our weak-ner framework and the dataset are publicly available on GitHub and Hugging Face.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2302.11157

Country:

Asia > Taiwan > Taiwan Province > Taipei (0.05)
North America > United States > Texas (0.04)
Asia > India (0.04)
(6 more...)

Genre: Research Report (0.50)

Industry:

Government > Regional Government > North America Government > United States Government (0.93)
Banking & Finance > Trading (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Recognizing and Extracting Cybersecurtity-relevant Entities from Text

Hanks, Casey, Maiden, Michael, Ranade, Priyanka, Finin, Tim, Joshi, Anupam

arXiv.org Artificial IntelligenceAug-2-2022

Cyber Threat Intelligence (CTI) is information describing threat vectors, vulnerabilities, and attacks and is often used as training data for AI-based cyber defense systems such as Cybersecurity Knowledge Graphs (CKG). There is a strong need to develop community-accessible datasets to train existing AI-based cybersecurity pipelines to efficiently and accurately extract meaningful insights from CTI. We have created an initial unstructured CTI corpus from a variety of open sources that we are using to train and test cybersecurity entity models using the spaCy framework and exploring self-learning methods to automatically recognize cybersecurity entities. We also describe methods to apply cybersecurity domain entity linking with existing world knowledge from Wikidata. Our future work will survey and test spaCy NLP tools and create methods for continuous integration of new information extracted from text.

annotation, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2208.01693

Country:

North America > United States > Maryland > Baltimore (0.28)
North America > United States > Maryland > Baltimore County (0.14)
Asia > North Korea (0.04)

Genre: Research Report (0.40)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback